Primary Census Products
Decennial Census
American Community Survey (ACS)
Welcome! While we’re waiting:
Open RStudio
Open a new R script file
About me
Describe primary Census data products
Introduce R packages for working with Census Data
Use those packages to fetch census data
Use those packages to fetch census data plus census geograpic boundary files
Make maps of census data
The “nation’s leading provider of quality data about its people and economy.”
Available at www.census.gov
Decennial Census
American Community Survey (ACS)
Complete count of the population every 10 years since 1790
Includes data on
population, by age & race/ethnicity
housing, by occupancy & tenure (owned, rented)
Annual survey of a sample of about 3 million household
Provides estimates of demographic, social, economic & housing characteristics
Includes margin of error values for the estimates.
| Demographic* | Social | Economic | Housing |
|---|---|---|---|
| Sex | Families | Income | Tenure* |
| Age | Education | Benefits | Occupancy* |
| Race | Marital Status | Employment Status | Structure Type |
| Hispanic Origin | Fertility | Occupation | Housing Value |
| Grandparents | Industry | Taxes & Insurance | |
| Veterans | Commuting | Utilities | |
| Disability Status | Place of Work | Mortgage | |
| Language at Home | Health Insurance | Monthly Rent | |
| Citizenship | |||
| Mobility |
Census data are publicly available at one or more levels of geographic aggregation.
ACS 1 year and 5 year products are currently available
ACS 5 year data provdes much better estimates, lower margins of error
More data available for ACS 5 Year product
Identify your - topic of interest - year(s) - geographic level of detail - for what locations?
Then determine what specific tables and variables are available - ACS or Decennial?
“If you want to measure change you can’t change the measures!”
Census tables, variables, geographies, and geographic boundaries change over time!
Measuring change over time with census data is it’s own thing, complex and not covered by this workshop!
These are the ones we recommend and will use today.
Functions for accessing census decennial and ACS 5 year datasets via Census APIs
Census API keyLimited set of years available via tidycensus
Most recent ACS available is 2012-2016
tidycensus & update package.Provides access to census geographic data files
- detailed TIGER/Line boundary files (e.g., shapefiles), or
- simplified Cartographic boundary files
Also provides access to census feature data, - eg, rivers, roads, coastlands, landmarks, and more
Used by tidycensus to access state, county, tract, block group, block, and ZCTA boundaries.
tigris directly to access other census geographic data.Packages developed by Kyle Walker to make it easier to fetch data from Census websites and APIs in R and get that data in a useable format to analyze, plot, and map.
Check out his website to keep abreast of his great packages, blog posts and tutorials.
Walker also develped a new DataCamp course Analyzing US Census Data in R!
You can write code to access the Census APIs directly.
You can download Census data directly from:
You can download Census geographic data directly on the census website
A collection of R Packages for data science - developed primarily by Hadley Wickham, Chief Scientist at RStudio.
dplyr and tidyr for reshaping data
ggplot2 for plotting
purr, readr and tibble for improved performance
These packages are used by tidyverse under the hood. ## sf
Simple features for geospatial data objects and methods.
sp packagesf includes the functionality of the sp, rgdal, rgeos and proj4 packages.
We will work through several exercises using tidycensus to fetch, wrangle and map census data.
Load the packages we will use today
library(tidycensus)
library(tidyverse)
library(tigris)
library(sf)
library(dplyr)
Also install any dependancies.
# install.packages("tidyverse")
# install.packages("tidycensus")
# install.packages("sf")
You need a census API key to programmatically fetch census data.
For more info see:
# Install your census api key.
# If you set install=TRUE you need only do this once.
#census_api_key("YOUR KEY GOES HERE", install = TRUE)
my_census_api_key <- "f2d6f4f743545d3a42a67412b05935dc7712c432"
census_api_key(my_census_api_key)
## To install your API key for use in future sessions, run this function with `install = TRUE`.
to the folder in which you unzipped the workshop files, e.g.,
setwd("~/Documents/Dlab/workshops/2018/rCensus_workshop")
Let’s start by fetching population data for each state from the 2010 Census
First step is to identify the names of the census variables that contain the data of interest.
The tidycensus load_variables function can help with that.
First, take a look at the function documentation.
?load_variables
Use load_variables to fetch all variables used in the 2010 census into a dataframe.
vars2010 <- load_variables(year=2010, # Year or end year for ACS
dataset = 'sf1', # 'sf1' for decennial or 'acs5'
cache = TRUE) # Whether to save fetched data locally
Let’s take a look at and discuss the resultant dataframe.
View(vars2010)
Topics: Population, housing
333
get_decenial is the tidycensus function for fetching data from the decennial census API.
First, check the documentation for the function.
?get_decennial
Let’s fetch total population by state (P001001) from the 2010 census using get_decennial.
pop2010 <- get_decennial(geography = "state", # census tabulation unit
variables = "P001001", # variable(s) of interest
year = 2010) # census year
## Getting data from the 2010 decennial Census
How many rows and columns?
What column has the pop counts?
pop2010
## # A tibble: 52 x 4
## GEOID NAME variable value
## <chr> <chr> <chr> <dbl>
## 1 01 Alabama P001001 4779736.
## 2 02 Alaska P001001 710231.
## 3 04 Arizona P001001 6392017.
## 4 05 Arkansas P001001 2915918.
## 5 06 California P001001 37253956.
## 6 08 Colorado P001001 5029196.
## 7 09 Connecticut P001001 3574097.
## 8 10 Delaware P001001 897934.
## 9 11 District of Columbia P001001 601723.
## 10 12 Florida P001001 18801310.
## # ... with 42 more rows
ggplot2 is the most commonly used R package for data visualization.
It is loaded when you load the tidyverse package.
ggplot very useful for both exploratory analysis and communicating results.pop_plot<- ggplot(data=pop2010, aes(x=reorder(NAME,value), y=value/1000000)) +
geom_bar(stat="identity") + coord_flip() +
theme_minimal() +
labs(title = "2010 US Population by State") +
xlab("State") +
ylab("in millions")
Fetch population data by state for 2000.
Don’t assume variable names are the same across years. Check first!
# What is the variable name in 2000?
vars2000 <- load_variables(year=2000, dataset = 'sf1', cache = T)
# Take a look and search in the dataframe
View(vars2000)
# Fetch the 2000 pop data
pop2000 <- get_decennial(geography = "state", variables = "P001001", year = 2000)
# Take a look
pop2000
In the previous example we retrieved population data for all states.
This is the default behavior if you don’t specify a subset.
You can limit the retrieved data by state or by county, if applicable.
Let’s fetch data for just 3 states.
state_pop2010 <- get_decennial(geography = "state", # census tabulation unit
variables = "P001001", # variables of interest
year = 2010, # census year
state=c("CA","OR","WA")) # Filter by states of interest
## Getting data from the 2010 decennial Census
state_pop2010
## # A tibble: 3 x 4
## GEOID NAME variable value
## <chr> <chr> <chr> <dbl>
## 1 06 California P001001 37253956.
## 2 41 Oregon P001001 3831074.
## 3 53 Washington P001001 6724540.
get_decennial accepts a number of different values for tabulation unit.
state, county, tract, block group, block, and ZCTA.Let’s change the tabulation unit to county.
co_pop2010 <- get_decennial(geography = "county", # census tabulation unit
variables = "P001001", # variables of interest
year = 2010)
## Getting data from the 2010 decennial Census
View the Data to see what was retrieved.
str(co_pop2010)
## Classes 'tbl_df', 'tbl' and 'data.frame': 3221 obs. of 4 variables:
## $ GEOID : chr "01001" "01007" "01017" "01019" ...
## $ NAME : chr "Autauga County, Alabama" "Bibb County, Alabama" "Chambers County, Alabama" "Cherokee County, Alabama" ...
## $ variable: chr "P001001" "P001001" "P001001" "P001001" ...
## $ value : num 54571 22915 34215 25989 25833 ...
head(co_pop2010, 2)
## # A tibble: 2 x 4
## GEOID NAME variable value
## <chr> <chr> <chr> <dbl>
## 1 01001 Autauga County, Alabama P001001 54571.
## 2 01007 Bibb County, Alabama P001001 22915.
tail(co_pop2010, 2)
## # A tibble: 2 x 4
## GEOID NAME variable value
## <chr> <chr> <chr> <dbl>
## 1 72151 Yabucoa Municipio, Puerto Rico P001001 37941.
## 2 72153 Yauco Municipio, Puerto Rico P001001 42043.
Fetch population by county for just California
Fetch population by county for Oregon & California
Fetch population by tract for all states.
Fetch population by tract for California.
If you want census data at the tract level or below you need to specifiy the state & county or counties.
tract_pop2010 <- get_decennial(geography = "tract", # census tabulation unit
variables = "P001001", # variable of interest
year = 2010, # census year
state="CA", # limit to state of California
county=c("Alameda","Contra Costa")) # and only these counties
## Getting data from the 2010 decennial Census
## Getting data from the 2010 decennial Census
## Getting data from the 2010 decennial Census
View the results! How many census tracts are in these 3 counties?
str(tract_pop2010)
## Classes 'tbl_df', 'tbl' and 'data.frame': 569 obs. of 4 variables:
## $ GEOID : chr "06001400100" "06001400200" "06001400300" "06001400400" ...
## $ NAME : chr "Census Tract 4001, Alameda County, California" "Census Tract 4002, Alameda County, California" "Census Tract 4003, Alameda County, California" "Census Tract 4004, Alameda County, California" ...
## $ variable: chr "P001001" "P001001" "P001001" "P001001" ...
## $ value : num 2937 1974 4865 3703 3517 ...
head(tract_pop2010, 2)
## # A tibble: 2 x 4
## GEOID NAME variable value
## <chr> <chr> <chr> <dbl>
## 1 06001400100 Census Tract 4001, Alameda County, California P001001 2937.
## 2 06001400200 Census Tract 4002, Alameda County, California P001001 1974.
tail(tract_pop2010, 2)
## # A tibble: 2 x 4
## GEOID NAME variable value
## <chr> <chr> <chr> <dbl>
## 1 06013392300 Census Tract 3923, Contra Costa County, Cali… P001001 3102.
## 2 06013990000 Census Tract 9900, Contra Costa County, Cali… P001001 0.
Fetch population by county for Alameda County, California
You can use names or FIPs codes as your state and county function arguments.
# County FIPs Codes for
# Alameda, SF, Contra Costa, Marin County, Napa,
# San Mateo, Santa Clara, Solano, Sonoma, santa cruz
nine_counties <- c("001", "075", "013", "041", "055", "081", "085", "095", "097", "087")
# Your code below...
bayarea_pop10 <-...
What three things are new here?
ur_pop10 <- get_decennial(geography = "county", # census tabulation unit
variables = c(urban="P002002",rural="P002005"),
year = 2010,
summary_var = "P002001", # Total Urban - the denominator
state='CA',
county=c("Napa","Sonoma","Mendocino"))
## Getting data from the 2010 decennial Census
ur_pop10
## # A tibble: 6 x 5
## GEOID NAME variable value summary_value
## <chr> <chr> <chr> <dbl> <dbl>
## 1 06045 Mendocino County, California urban 48110. 87841.
## 2 06055 Napa County, California urban 118194. 136484.
## 3 06097 Sonoma County, California urban 424102. 483878.
## 4 06045 Mendocino County, California rural 39731. 87841.
## 5 06055 Napa County, California rural 18290. 136484.
## 6 06097 Sonoma County, California rural 59776. 483878.
The summary_value column comes in handy when you want to compute percent of total.
Here’s one way to do it.
ur_pop10 <- ur_pop10 %>%
mutate(pct = 100 * (value / summary_value))
## Warning: package 'bindrcpp' was built under R version 3.4.4
Let’s take a look at the output
ur_pop10 # Take a look
## # A tibble: 6 x 6
## GEOID NAME variable value summary_value pct
## <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 06045 Mendocino County, California urban 48110. 87841. 54.8
## 2 06055 Napa County, California urban 118194. 136484. 86.6
## 3 06097 Sonoma County, California urban 424102. 483878. 87.6
## 4 06045 Mendocino County, California rural 39731. 87841. 45.2
## 5 06055 Napa County, California rural 18290. 136484. 13.4
## 6 06097 Sonoma County, California rural 59776. 483878. 12.4
Plots give us compact visual summaries of the data
myplot <- ggplot(data = ur_pop10,
mapping = aes(x = NAME, fill = variable,
y = ifelse(test = variable == "urban",
yes = -pct, no = pct))) +
geom_bar(stat = "identity") +
scale_y_continuous(labels = abs, limits=c(-100,100)) +
labs(title="Urban & Rural Population in Wine Country",
x="County", y = " Percent of Population", fill="") +
coord_flip()
myplot
This is often helpful but you need to keep tract of the meaning of each variable.
alco_pop10 <- get_decennial(geography = "tract", # Census tabulation unit
table = "P002", # Table of urban & rural population counts
year = 2010, # Decennial census year
state='CA', # Filter state
county="Alameda") # Filter county
## Getting data from the 2010 decennial Census
unique(alco_pop10$variable)
## [1] "P002001" "P002002" "P002003" "P002004" "P002005" "P002006"
alco_pop10
## # A tibble: 2,166 x 4
## GEOID NAME variable value
## <chr> <chr> <chr> <dbl>
## 1 06001400100 Census Tract 4001, Alameda County, Californ… P002001 2937.
## 2 06001400200 Census Tract 4002, Alameda County, Californ… P002001 1974.
## 3 06001400300 Census Tract 4003, Alameda County, Californ… P002001 4865.
## 4 06001400400 Census Tract 4004, Alameda County, Californ… P002001 3703.
## 5 06001400500 Census Tract 4005, Alameda County, Californ… P002001 3517.
## 6 06001400600 Census Tract 4006, Alameda County, Californ… P002001 1571.
## 7 06001400700 Census Tract 4007, Alameda County, Californ… P002001 4206.
## 8 06001400800 Census Tract 4008, Alameda County, Californ… P002001 3594.
## 9 06001400900 Census Tract 4009, Alameda County, Californ… P002001 2302.
## 10 06001401000 Census Tract 4010, Alameda County, Californ… P002001 5678.
## # ... with 2,156 more rows
You cannot do this with one tidycensus function call unless you wrap it in a function.
pop2000 <- get_decennial(geography = "state", variables = "P001001", year = 2000)
pop2010 <- get_decennial(geography = "state", variables = "P001001", year = 2010)
Let’s try all three of these commands and then look at the ouput to see what’s different?
pop2010a <- get_decennial(geography = "state", variables = "P001001", year = 2010)
pop2010b <- get_decennial(geography = "state", variables = c(pop10="P001001"), year = 2010)
pop2010c <- get_decennial(geography = "state", variables = c(pop00="P001001"), year = 2010, output="wide")
Your R skills can help you reformat the data and make it more useable.
First, fetch population data for 2010 & 2000 by state with output=wide.
Label the variables pop00 and pop10.
pop2000 <- get_decennial(geography = "state", variables = c(pop00="P001001"), year = 2000, output="wide")
## Getting data from the 2000 decennial Census
pop2010 <- get_decennial(geography = "state", variables = c(pop10="P001001"), year = 2010, output="wide")
## Getting data from the 2010 decennial Census
pop2000_2010 <- pop2000 %>% merge(pop2010, by="NAME") %>%
select(NAME, pop00, pop10)
View results
pop2000_2010
## NAME pop00 pop10
## 1 Alabama 4447100 4779736
## 2 Alaska 626932 710231
## 3 Arizona 5130632 6392017
## 4 Arkansas 2673400 2915918
## 5 California 33871648 37253956
## 6 Colorado 4301261 5029196
## 7 Connecticut 3405565 3574097
## 8 Delaware 783600 897934
## 9 District of Columbia 572059 601723
## 10 Florida 15982378 18801310
## 11 Georgia 8186453 9687653
## 12 Hawaii 1211537 1360301
## 13 Idaho 1293953 1567582
## 14 Illinois 12419293 12830632
## 15 Indiana 6080485 6483802
## 16 Iowa 2926324 3046355
## 17 Kansas 2688418 2853118
## 18 Kentucky 4041769 4339367
## 19 Louisiana 4468976 4533372
## 20 Maine 1274923 1328361
## 21 Maryland 5296486 5773552
## 22 Massachusetts 6349097 6547629
## 23 Michigan 9938444 9883640
## 24 Minnesota 4919479 5303925
## 25 Mississippi 2844658 2967297
## 26 Missouri 5595211 5988927
## 27 Montana 902195 989415
## 28 Nebraska 1711263 1826341
## 29 Nevada 1998257 2700551
## 30 New Hampshire 1235786 1316470
## 31 New Jersey 8414350 8791894
## 32 New Mexico 1819046 2059179
## 33 New York 18976457 19378102
## 34 North Carolina 8049313 9535483
## 35 North Dakota 642200 672591
## 36 Ohio 11353140 11536504
## 37 Oklahoma 3450654 3751351
## 38 Oregon 3421399 3831074
## 39 Pennsylvania 12281054 12702379
## 40 Rhode Island 1048319 1052567
## 41 South Carolina 4012012 4625364
## 42 South Dakota 754844 814180
## 43 Tennessee 5689283 6346105
## 44 Texas 20851820 25145561
## 45 Utah 2233169 2763885
## 46 Vermont 608827 625741
## 47 Virginia 7078515 8001024
## 48 Washington 5894121 6724540
## 49 West Virginia 1808344 1852994
## 50 Wisconsin 5363675 5686986
## 51 Wyoming 493782 563626
To check my data values
ggplot() +
geom_bar(data=pop2000_2010, aes(x=reorder(NAME,pop10), y=pop10/1000000), stat="identity") + coord_flip() +
geom_point(data=pop2000_2010, aes(x=reorder(NAME,pop10), y=pop00/1000000, col="red")) + coord_flip()
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.
Use write.csv to save a data frame to a CSV file.
write.csv(pop2000_2010, file="pop2000_2010.csv", row.names = FALSE)
tidycensusWe can fetch geographic data with tidycensus by adding geometry=TRUE to our function calls.
Under the hood, tidycensus calls the tigris package to fetch data from the Census Geographic Data APIs.
Only a subset of data available via tigris can be accessed via tidycensus.
Before fetching geometry, we need to specify a few tigris options
class of returned data to be sf objects (not sp, the default)tigris_use_cache to TRUEcache directory (optional)# Tigris options - used by tidycensus
options(tigris_class = "sf") # SP is the default format returned by tigris
options(tigris_use_cache = TRUE) # Save retrieved data locally
#tigris_cache_dir("~/Documents/gis_data/census") # Folder for local data
We fetch the geospatial data by setting geometry=TRUE.
pop2010geo <- get_decennial(geography = "state",
variables = c(pop10="P001001"),
year = 2010,
output="wide",
geometry=TRUE) # Fetch geometry with the data for mapping
## Getting data from the 2010 decennial Census
pop2010geo
## Simple feature collection with 52 features and 3 fields
## geometry type: MULTIPOLYGON
## dimension: XY
## bbox: xmin: -179.1473 ymin: 17.88481 xmax: 179.7785 ymax: 71.35256
## epsg (SRID): 4269
## proj4string: +proj=longlat +datum=NAD83 +no_defs
## # A tibble: 52 x 4
## GEOID NAME pop10 geometry
## <chr> <chr> <dbl> <MULTIPOLYGON [°]>
## 1 01 Alabama 4779736. (((-85.00237 31.00068, -85.02411 …
## 2 02 Alaska 710231. (((-164.9762 54.13459, -164.9378 …
## 3 04 Arizona 6392017. (((-109.0452 36.99908, -109.0452 …
## 4 05 Arkansas 2915918. (((-94.55929 36.4995, -94.51948 3…
## 5 06 California 37253956. (((-122.4463 37.86105, -122.4386 …
## 6 08 Colorado 5029196. (((-102.0422 36.99308, -102.0545 …
## 7 09 Connecticut 3574097. (((-71.85957 41.3224, -71.86823 4…
## 8 10 Delaware 897934. (((-75.55945 39.62981, -75.5591 3…
## 9 11 District of Columbia 601723. (((-77.0386 38.79151, -77.0389 38…
## 10 12 Florida 18801310. (((-85.15641 29.67963, -85.1374 2…
## # ... with 42 more rows
Sys.getenv('TIGRIS_CACHE_DIR') # check cache dir
## [1] "/Users/pattyf/Documents/gis_data/census"
R sf objects include
geometry of type (MULTI) POINT, LINE, POLYGON
geometrya dataframe of attributes
CRS (coordinate reference system), specified by
We can use plot to make a quick map the geometry…
plot(pop2010geo$geometry)
What do you get if you plot the sf object without specifying “$geometry”
Almost all of the USA is between -66 and -180 degrees latitude.
But Alaska includes a few islands that are between 180-172 degrees latitude.
This makes it difficult to map the US centered at 0,0 lat/lon.
plot(pop2010geo$geometry)
tidycensus includes a shift_geo parameter to shift AK & HI to below Texas.
pop2010geo_shifted <- get_decennial(geography = "state",
variables = c(pop10="P001001"),
output="wide",
year = 2010,
geometry=TRUE,
shift_geo=TRUE)
## Getting data from the 2010 decennial Census
## Using feature geometry obtained from the albersusa package
## Please note: Alaska and Hawaii are being shifted and are not to scale.
plot(pop2010geo_shifted$geometry)
You can save sf data to a shapefile using st_write
st_write(pop2010geo_shifted,"usa_2010_shifted.shp")
plot(pop2010geo_shifted['pop10'])
ggplot(pop2010geo_shifted, aes(fill = pop10)) +
geom_sf()
ggplot(pop2010geo_shifted, aes(fill = pop10/1000000, color = pop10/1000000)) +
geom_sf() +
scale_fill_viridis_c() + scale_color_viridis_c(guide = FALSE) +
theme_minimal() +
labs(title = "Population by State, 2010", fill="Population (millions)")
Create a map of CA Population in 2010 by county
cal_pop10 <- get_decennial(geography = "county",
variables = "P001001",
year = 2010,
state='CA',
geometry=TRUE)
## Getting data from the 2010 decennial Census
plot(cal_pop10['value'])
We can fetch both the census data and geometry for more than one state.
west_pop10 <- get_decennial(geography = "county",
variables = "P001001",
year = 2010,
state=c('CA','OR','NV',"AZ"),
geometry=T)
## Getting data from the 2010 decennial Census
Fetching the data for all tracts in one state.
Need to specify county!.
# Optional state & county argments to get more limited geo subset
alco_pop10 <- get_decennial(geography = "tract",
variables = "P001001",
year = 2010,
state='CA',
county='Alameda',
geometry=T)
## Getting data from the 2010 decennial Census
Fetch and map the 2010 population by census tract for Alameda and Countra Costa counties.
alcc_pop10 <- get_decennial(geography = "tract",
variables = "P001001",
year = 2010,
state='CA',
county=c("Alameda","Contra Costa"),
geometry=T)
## Getting data from the 2010 decennial Census
## Getting data from the 2010 decennial Census
## Getting data from the 2010 decennial Census
plot(alcc_pop10['value'])
Fetch and map the percent of San Francicso properties by census tract that were coded as rented in the 2010 Census.
Step 1 - indentify the variables for the
total number of hounsing units
number of renter occupied units
sf_rented <- get_decennial(geography = "tract", # census tabulation unit
variables = "H004004",
year = 2010,
summary_var = "H004001", # Total Urban - the denominator
state='CA',
county='San Francisco',
geometry=T)
sf_pct_rented <- sf_rented[sf_rented$value > 0,] %>%
mutate(pct = 100 * (value / summary_value))
plot(sf_pct_rented['pct'])
You can use the tidycensus get_acs function to retrieve data for the ACS 5 year products, beginning with the 2005 - 2010 dataset.
The default end year is the most recent one in tidycensus which is 2016 as of Dec 6, 2018.
get_acs syntax and options are very similar to get_decennial.
But there are many more data variables!
Let’s start by fetching ACS 5-year 2016 data on poverty.
We want to explore the number of folks living below the poverty level by census tract.
First we need to find the variable name(s)!
Load the ACS 2012-2016 5 year data variables into a dataframe.
end year in tidycensus!Then search for the variable names.
How many variables refer to the concept of poverty?
acs2016vars <- load_variables(year=2016, dataset = 'acs5', cache = T)
#View(acs2016vars)
Many hundreds (thousands?) more than for decennial census!
See the documentation on the census website
Types of tables:
B prefix = base tablesC = collapsed tablesDP = data profilesS = Subject tablesACS variables can be confusing.
The Census Reporter website (https://censusreporter.org) provides another tool for navigating topics, tables, and variable names.
Let’s check it out to see what tables/variables we should use.
Use the tidycensus get_acs function to fetch the poverty data for census tracts in San Francisco
?get_acs
sf_poor <- get_acs(geography = "tract",
table = "C17002", # Table: ratio income to poverty
year = 2016,
state="CA",
summary_var = "C17002_001", # Est of num people - denom
county="San Francisco",
geometry=T)
## Getting data from the 2012-2016 5-year ACS
Let’s take a look at the output of get_acs and discuss how it differs from get_decennial.
sf_poor
## Simple feature collection with 1576 features and 7 fields (with 8 geometries empty)
## geometry type: MULTIPOLYGON
## dimension: XY
## bbox: xmin: -123.0139 ymin: 37.69274 xmax: -122.3276 ymax: 37.86334
## epsg (SRID): 4269
## proj4string: +proj=longlat +datum=NAD83 +no_defs
## First 10 features:
## GEOID NAME
## 1 06075010100 Census Tract 101, San Francisco County, California
## 2 06075010100 Census Tract 101, San Francisco County, California
## 3 06075010100 Census Tract 101, San Francisco County, California
## 4 06075010100 Census Tract 101, San Francisco County, California
## 5 06075010100 Census Tract 101, San Francisco County, California
## 6 06075010100 Census Tract 101, San Francisco County, California
## 7 06075010100 Census Tract 101, San Francisco County, California
## 8 06075010100 Census Tract 101, San Francisco County, California
## 9 06075010200 Census Tract 102, San Francisco County, California
## 10 06075010200 Census Tract 102, San Francisco County, California
## variable estimate moe summary_est summary_moe
## 1 C17002_001 3972 291 3972 291
## 2 C17002_002 314 196 3972 291
## 3 C17002_003 397 185 3972 291
## 4 C17002_004 120 130 3972 291
## 5 C17002_005 236 129 3972 291
## 6 C17002_006 123 112 3972 291
## 7 C17002_007 444 304 3972 291
## 8 C17002_008 2338 387 3972 291
## 9 C17002_001 4300 442 4300 442
## 10 C17002_002 160 123 4300 442
## geometry
## 1 MULTIPOLYGON (((-122.4206 3...
## 2 MULTIPOLYGON (((-122.4206 3...
## 3 MULTIPOLYGON (((-122.4206 3...
## 4 MULTIPOLYGON (((-122.4206 3...
## 5 MULTIPOLYGON (((-122.4206 3...
## 6 MULTIPOLYGON (((-122.4206 3...
## 7 MULTIPOLYGON (((-122.4206 3...
## 8 MULTIPOLYGON (((-122.4206 3...
## 9 MULTIPOLYGON (((-122.4267 3...
## 10 MULTIPOLYGON (((-122.4267 3...
First remove census tracts that have no people!
sf_poor2 <- subset(sf_poor, summary_est > 0)
plot(sf_poor2['estimate'])
Let’s calculate the percent below poverty by tract.
sf_poor3 <- sf_poor2 %>%
mutate(pct = 100 * (estimate / summary_est))
sf_poor3
## Simple feature collection with 1560 features and 8 fields
## geometry type: MULTIPOLYGON
## dimension: XY
## bbox: xmin: -122.5145 ymin: 37.70813 xmax: -122.3276 ymax: 37.86334
## epsg (SRID): 4269
## proj4string: +proj=longlat +datum=NAD83 +no_defs
## First 10 features:
## GEOID NAME
## 1 06075010100 Census Tract 101, San Francisco County, California
## 2 06075010100 Census Tract 101, San Francisco County, California
## 3 06075010100 Census Tract 101, San Francisco County, California
## 4 06075010100 Census Tract 101, San Francisco County, California
## 5 06075010100 Census Tract 101, San Francisco County, California
## 6 06075010100 Census Tract 101, San Francisco County, California
## 7 06075010100 Census Tract 101, San Francisco County, California
## 8 06075010100 Census Tract 101, San Francisco County, California
## 9 06075010200 Census Tract 102, San Francisco County, California
## 10 06075010200 Census Tract 102, San Francisco County, California
## variable estimate moe summary_est summary_moe pct
## 1 C17002_001 3972 291 3972 291 100.000000
## 2 C17002_002 314 196 3972 291 7.905337
## 3 C17002_003 397 185 3972 291 9.994965
## 4 C17002_004 120 130 3972 291 3.021148
## 5 C17002_005 236 129 3972 291 5.941591
## 6 C17002_006 123 112 3972 291 3.096677
## 7 C17002_007 444 304 3972 291 11.178248
## 8 C17002_008 2338 387 3972 291 58.862034
## 9 C17002_001 4300 442 4300 442 100.000000
## 10 C17002_002 160 123 4300 442 3.720930
## geometry
## 1 MULTIPOLYGON (((-122.4206 3...
## 2 MULTIPOLYGON (((-122.4206 3...
## 3 MULTIPOLYGON (((-122.4206 3...
## 4 MULTIPOLYGON (((-122.4206 3...
## 5 MULTIPOLYGON (((-122.4206 3...
## 6 MULTIPOLYGON (((-122.4206 3...
## 7 MULTIPOLYGON (((-122.4206 3...
## 8 MULTIPOLYGON (((-122.4206 3...
## 9 MULTIPOLYGON (((-122.4267 3...
## 10 MULTIPOLYGON (((-122.4267 3...
We want to subset on and then combine the percents of people in each tract who are below 50% and 99% of the poverty line.
sf_poor4 <- subset(sf_poor3, (variable %in% c("C17002_002","C17002_003")))
head(sf_poor4)
## Simple feature collection with 6 features and 8 fields
## geometry type: MULTIPOLYGON
## dimension: XY
## bbox: xmin: -122.4267 ymin: 37.79847 xmax: -122.3996 ymax: 37.81144
## epsg (SRID): 4269
## proj4string: +proj=longlat +datum=NAD83 +no_defs
## GEOID NAME
## 2 06075010100 Census Tract 101, San Francisco County, California
## 3 06075010100 Census Tract 101, San Francisco County, California
## 10 06075010200 Census Tract 102, San Francisco County, California
## 11 06075010200 Census Tract 102, San Francisco County, California
## 18 06075010300 Census Tract 103, San Francisco County, California
## 19 06075010300 Census Tract 103, San Francisco County, California
## variable estimate moe summary_est summary_moe pct
## 2 C17002_002 314 196 3972 291 7.905337
## 3 C17002_003 397 185 3972 291 9.994965
## 10 C17002_002 160 123 4300 442 3.720930
## 11 C17002_003 75 55 4300 442 1.744186
## 18 C17002_002 194 102 4341 426 4.469016
## 19 C17002_003 431 204 4341 426 9.928588
## geometry
## 2 MULTIPOLYGON (((-122.4206 3...
## 3 MULTIPOLYGON (((-122.4206 3...
## 10 MULTIPOLYGON (((-122.4267 3...
## 11 MULTIPOLYGON (((-122.4267 3...
## 18 MULTIPOLYGON (((-122.4187 3...
## 19 MULTIPOLYGON (((-122.4187 3...
sf_poor5 <- sf_poor4 %>%
select(GEOID, pct, geometry) %>%
group_by(GEOID) %>%
summarise(pct_below_pov = sum(pct))
head(sf_poor5)
## Simple feature collection with 6 features and 2 fields
## geometry type: POLYGON
## dimension: XY
## bbox: xmin: -122.4267 ymin: 37.79323 xmax: -122.3897 ymax: 37.81144
## epsg (SRID): 4269
## proj4string: +proj=longlat +datum=NAD83 +no_defs
## # A tibble: 6 x 3
## GEOID pct_below_pov geometry
## <chr> <dbl> <POLYGON [°]>
## 1 06075010100 17.9 ((-122.4206 37.81111, -122.4075 37.81144, -12…
## 2 06075010200 5.47 ((-122.4267 37.80964, -122.4249 37.8108, -122…
## 3 06075010300 14.4 ((-122.4187 37.80593, -122.4169 37.80521, -12…
## 4 06075010400 7.93 ((-122.4149 37.80354, -122.4116 37.80396, -12…
## 5 06075010500 8.53 ((-122.4052 37.80476, -122.4035 37.80509, -12…
## 6 06075010600 24.7 ((-122.411 37.80117, -122.4078 37.80157, -122…
Where are SF’s poorest areas?
plot(sf_poor5['pct_below_pov'])
tmapThe tmap package is great for making both static and interactive maps. It turns R into a GIS.
Let’s check it out with our last dataframe.
library(tmap)
## Warning: package 'tmap' was built under R version 3.4.4
tmap_mode("view") # set mode to interactive
## tmap mode set to interactive viewing
poverty_map <- tm_shape(sf_poor5) +
tm_polygons(col="pct_below_pov")
View the map - click on tracts
poverty_map
There are a number of great tutorials online for working with tmap.
See the References at the end of this workshop document.
Cartographic Boundary data vs Detailed TIGER/Line data
By default, tidycensus downloads census cartographic boundary data.
These are simplifed geometries, clipped to coastlines.
In get_acs you can also request the more detailed census TIGER data.
Let’s check it out.
sf_poor_cb <- get_acs(geography = "tract",
table = "C17002",
summary_var = "C17002_001",
year = 2016,
state="CA",
county="San Francisco",
geometry=TRUE,
cb = TRUE) # THIS IS THE DEFAULT!
## Getting data from the 2012-2016 5-year ACS
sf_poor_tl <- get_acs(geography = "tract",
table = "C17002",
summary_var = "C17002_001",
year = 2016,
state="CA",
county="San Francisco",
geometry=TRUE,
cb = FALSE) # Fetching the TIGER/Line data
## Getting data from the 2012-2016 5-year ACS
par(mfrow=c(1,2))
plot(sf_poor_cb[sf_poor$summary_est>0,]$geometry)
plot(sf_poor_tl[sf_poor$summary_est>0,]$geometry)
par(mfrow=c(1,1))
tm_shape(sf_poor_tl) +
tm_borders() +
tm_shape(sf_poor_cb) +
tm_borders(col="red")
TO BE ADDED….
Read in census variables of interest from a file
# Load cenvar lookup table of vars of interest
my_cenvar_df <-read.csv("data/cenvar_lookup.csv", strip.white = T, stringsAsFactors = F)
my_cenvar_df
## my_cen_var_names my_cen_vars
## 1 citizenship_totpop B05001_001E
## 2 citizenship_non_citizen B05001_006E
## 3 entry_totpop B05005_001E
## 4 entry_2010 B05005_002E
## 5 entry_2000_2009 B05005_007E
## 6 birthplace_totpop B05007_001E
## 7 birthplace_europ B05007_014E
## 8 birthplace_asian B05007_027E
## 9 birthplace_latinAmerica B05007_040E
## 10 birthplace_southAmerica B05007_081E
## 11 birthplace_other_nonUSA B05007_094E
## 12 birthplace_byage_totpop B06001_001E
## 13 birthplace_byage_fborn B06001_049E
## 14 poverty_totpop B06012_001E
## 15 below_pov B06012_002E
## 16 below_pov2 B06012_003E
## 17 poverty_fborn_totpop B06012_017E
## 18 below_pov_fborn B06012_018E
## 19 below_pov2_fborn B06012_019E
## 20 health_native_totpop B27020_002E
## 21 health_native_noinsurance B27020_006E
## 22 health_fborn_nat_totpop B27020_008E
## 23 fborn_nohealth_naturalized B27020_012E
## 24 health_fborn_noncit_totpop B27020_013E
## 25 fborn_nohealth_noncitizen B27020_017E
Fetch the ACS data for these variables for the 9 county bay area
nine_counties <- c("001", "075", "013", "041", "055", "081", "085", "095", "097", "087")
bay9_data <-get_acs(geography = "tract",
variables = my_cenvar_df$my_cen_vars,
year=2016,
state = "CA",
county = nine_counties,
geometry = T)
## Getting data from the 2012-2016 5-year ACS
bay9_data
## Simple feature collection with 41025 features and 5 fields (with 175 geometries empty)
## geometry type: MULTIPOLYGON
## dimension: XY
## bbox: xmin: -123.5335 ymin: 36.85065 xmax: -121.2082 ymax: 38.86424
## epsg (SRID): 4269
## proj4string: +proj=longlat +datum=NAD83 +no_defs
## First 10 features:
## GEOID NAME variable
## 1 06001400100 Census Tract 4001, Alameda County, California B05001_001
## 2 06001400100 Census Tract 4001, Alameda County, California B05001_006
## 3 06001400100 Census Tract 4001, Alameda County, California B05005_001
## 4 06001400100 Census Tract 4001, Alameda County, California B05005_002
## 5 06001400100 Census Tract 4001, Alameda County, California B05005_007
## 6 06001400100 Census Tract 4001, Alameda County, California B05007_001
## 7 06001400100 Census Tract 4001, Alameda County, California B05007_014
## 8 06001400100 Census Tract 4001, Alameda County, California B05007_027
## 9 06001400100 Census Tract 4001, Alameda County, California B05007_040
## 10 06001400100 Census Tract 4001, Alameda County, California B05007_081
## estimate moe geometry
## 1 3018 195 MULTIPOLYGON (((-122.2469 3...
## 2 218 106 MULTIPOLYGON (((-122.2469 3...
## 3 944 168 MULTIPOLYGON (((-122.2469 3...
## 4 61 50 MULTIPOLYGON (((-122.2469 3...
## 5 176 99 MULTIPOLYGON (((-122.2469 3...
## 6 843 156 MULTIPOLYGON (((-122.2469 3...
## 7 208 75 MULTIPOLYGON (((-122.2469 3...
## 8 546 141 MULTIPOLYGON (((-122.2469 3...
## 9 46 43 MULTIPOLYGON (((-122.2469 3...
## 10 11 17 MULTIPOLYGON (((-122.2469 3...
bay9_data2 <- bay9_data %>%
select("GEOID", "variable", "estimate") %>%
spread(key=variable, value=estimate)
bay9_data2
## Simple feature collection with 1641 features and 26 fields (with 7 geometries empty)
## geometry type: MULTIPOLYGON
## dimension: XY
## bbox: xmin: -123.5335 ymin: 36.85065 xmax: -121.2082 ymax: 38.86424
## epsg (SRID): 4269
## proj4string: +proj=longlat +datum=NAD83 +no_defs
## First 10 features:
## GEOID B05001_001 B05001_006 B05005_001 B05005_002 B05005_007
## 1 06001400100 3018 218 944 61 176
## 2 06001400200 1960 92 315 33 55
## 3 06001400300 5236 317 900 130 314
## 4 06001400400 4171 190 524 69 115
## 5 06001400500 3748 430 701 226 123
## 6 06001400600 1661 62 205 21 9
## 7 06001400700 4552 353 619 122 177
## 8 06001400800 3506 276 767 108 185
## 9 06001400900 2262 202 446 20 16
## 10 06001401000 6193 477 754 90 186
## B05007_001 B05007_014 B05007_027 B05007_040 B05007_081 B05007_094
## 1 843 208 546 46 11 43
## 2 243 65 136 28 19 14
## 3 857 119 90 257 86 391
## 4 471 146 228 35 15 62
## 5 635 160 83 245 37 147
## 6 178 49 74 38 0 17
## 7 587 70 145 299 67 73
## 8 741 181 330 149 8 81
## 9 405 20 136 148 38 101
## 10 671 56 46 519 67 50
## B06001_001 B06001_049 B06012_001 B06012_002 B06012_003 B06012_017
## 1 3018 843 3011 113 20 843
## 2 1960 243 1952 106 84 243
## 3 5236 857 5153 450 217 848
## 4 4171 471 4158 268 198 471
## 5 3748 635 3733 339 240 635
## 6 1661 178 1656 158 229 178
## 7 4552 587 4552 820 723 587
## 8 3506 741 3457 381 348 741
## 9 2262 405 2228 358 268 405
## 10 6193 671 6184 1466 768 671
## B06012_018 B06012_019 B27020_002 B27020_006 B27020_008 B27020_012
## 1 31 12 2175 88 625 12
## 2 31 22 1717 38 151 0
## 3 124 183 4379 111 540 34
## 4 52 82 3694 152 281 0
## 5 151 58 3113 243 205 6
## 6 59 21 1483 143 116 8
## 7 107 103 3965 512 234 33
## 8 75 191 2759 184 465 99
## 9 66 39 1857 103 203 15
## 10 120 17 5513 463 194 0
## B27020_013 B27020_017 geometry
## 1 218 10 MULTIPOLYGON (((-122.2469 3...
## 2 92 35 MULTIPOLYGON (((-122.2574 3...
## 3 317 15 MULTIPOLYGON (((-122.2642 3...
## 4 190 21 MULTIPOLYGON (((-122.2618 3...
## 5 430 148 MULTIPOLYGON (((-122.2694 3...
## 6 62 0 MULTIPOLYGON (((-122.2681 3...
## 7 353 147 MULTIPOLYGON (((-122.2779 3...
## 8 276 79 MULTIPOLYGON (((-122.2887 3...
## 9 202 28 MULTIPOLYGON (((-122.2856 3...
## 10 477 111 MULTIPOLYGON (((-122.2787 3...
Use our dataframe of census variables to rename the columns so that they are self-describing.
colnames(bay9_data2)<-c("GEOID", my_cenvar_df$my_cen_var_names, "geometry")
bay9_data2
## Simple feature collection with 1641 features and 26 fields (with 7 geometries empty)
## geometry type: MULTIPOLYGON
## dimension: XY
## bbox: xmin: -123.5335 ymin: 36.85065 xmax: -121.2082 ymax: 38.86424
## epsg (SRID): 4269
## proj4string: +proj=longlat +datum=NAD83 +no_defs
## First 10 features:
## GEOID citizenship_totpop citizenship_non_citizen entry_totpop
## 1 06001400100 3018 218 944
## 2 06001400200 1960 92 315
## 3 06001400300 5236 317 900
## 4 06001400400 4171 190 524
## 5 06001400500 3748 430 701
## 6 06001400600 1661 62 205
## 7 06001400700 4552 353 619
## 8 06001400800 3506 276 767
## 9 06001400900 2262 202 446
## 10 06001401000 6193 477 754
## entry_2010 entry_2000_2009 birthplace_totpop birthplace_europ
## 1 61 176 843 208
## 2 33 55 243 65
## 3 130 314 857 119
## 4 69 115 471 146
## 5 226 123 635 160
## 6 21 9 178 49
## 7 122 177 587 70
## 8 108 185 741 181
## 9 20 16 405 20
## 10 90 186 671 56
## birthplace_asian birthplace_latinAmerica birthplace_southAmerica
## 1 546 46 11
## 2 136 28 19
## 3 90 257 86
## 4 228 35 15
## 5 83 245 37
## 6 74 38 0
## 7 145 299 67
## 8 330 149 8
## 9 136 148 38
## 10 46 519 67
## birthplace_other_nonUSA birthplace_byage_totpop birthplace_byage_fborn
## 1 43 3018 843
## 2 14 1960 243
## 3 391 5236 857
## 4 62 4171 471
## 5 147 3748 635
## 6 17 1661 178
## 7 73 4552 587
## 8 81 3506 741
## 9 101 2262 405
## 10 50 6193 671
## poverty_totpop below_pov below_pov2 poverty_fborn_totpop
## 1 3011 113 20 843
## 2 1952 106 84 243
## 3 5153 450 217 848
## 4 4158 268 198 471
## 5 3733 339 240 635
## 6 1656 158 229 178
## 7 4552 820 723 587
## 8 3457 381 348 741
## 9 2228 358 268 405
## 10 6184 1466 768 671
## below_pov_fborn below_pov2_fborn health_native_totpop
## 1 31 12 2175
## 2 31 22 1717
## 3 124 183 4379
## 4 52 82 3694
## 5 151 58 3113
## 6 59 21 1483
## 7 107 103 3965
## 8 75 191 2759
## 9 66 39 1857
## 10 120 17 5513
## health_native_noinsurance health_fborn_nat_totpop
## 1 88 625
## 2 38 151
## 3 111 540
## 4 152 281
## 5 243 205
## 6 143 116
## 7 512 234
## 8 184 465
## 9 103 203
## 10 463 194
## fborn_nohealth_naturalized health_fborn_noncit_totpop
## 1 12 218
## 2 0 92
## 3 34 317
## 4 0 190
## 5 6 430
## 6 8 62
## 7 33 353
## 8 99 276
## 9 15 202
## 10 0 477
## fborn_nohealth_noncitizen geometry
## 1 10 MULTIPOLYGON (((-122.2469 3...
## 2 35 MULTIPOLYGON (((-122.2574 3...
## 3 15 MULTIPOLYGON (((-122.2642 3...
## 4 21 MULTIPOLYGON (((-122.2618 3...
## 5 148 MULTIPOLYGON (((-122.2694 3...
## 6 0 MULTIPOLYGON (((-122.2681 3...
## 7 147 MULTIPOLYGON (((-122.2779 3...
## 8 79 MULTIPOLYGON (((-122.2887 3...
## 9 28 MULTIPOLYGON (((-122.2856 3...
## 10 111 MULTIPOLYGON (((-122.2787 3...
Latest is 2012 - 2016
2013 - 2017 release date is dec 8, 2018
See: https://www.census.gov/programs-surveys/acs/news/data-releases/2017/release-schedule.html
See when tidycensus package is updated….!!!
Requires variable name to be the same!
# use purr::map_df to get data for multiple years (must have same vars!)
pop90_10 <- map_df(c(1990, 2000, 2010), function(x) {
get_decennial(geography = "state",
variables = c(totalpop = "P001001"),
dataset = "sf1",
year = x) %>%
mutate(year = x) }
)
# View output
head(pop90_10)
tail(pop90_10)
# Plot it
pop90_10 %>% ggplot(aes(x=reorder(NAME,value), y=value/1000000, fill=factor(year))) +
geom_bar(stat="identity", position=position_dodge()) + coord_flip()
One of the strenghts of the sf package is how relatively easy it is to reaggregate data from one geometry to another. This process is called areal interpolation.
Area weighted interpolation reaggregates the data based on the percent of area shared by input and output geometeries.
sfnhoods<- st_read("data/sfnhoods.shp")
head(sfnhoods)
plot(sfnhoods['nhood'])
st_crs(sfnhoods)
st_crs(sf_poor5)
sf_poor5_4326 = st_transform(sf_poor5, st_crs(sfnhoods))
Reaggregate percent of people below poverty from census tract to neighborhood polygons.
sfhoods2 = st_interpolate_aw(sf_poor5_4326[, "pct_below_pov"], sfnhoods,
extensive = F) # True= aw sum; False= aw avg
par(mfrow=c(1,2))
plot(sf_poor5['pct_below_pov'])
plot(sfhoods2['pct_below_pov'])
par(mfrow=c(1,1))
tmaptm_shape(sfhoods2) +
tm_polygons(col="pct_below_pov")
head(sfhoods2)
sfnhoods$pct_below_pov <- sfhoods2$pct_below_pov
# map again - click on polygons and view data in popups
# to confirm the AWI output values
tm_shape(sfnhoods) +
tm_polygons(col="pct_below_pov",
popup.vars = c("nhood", "pct_below_pov")
)